The amount of electronic documents in the Internet grows very quickly. How to effectively identify subjects for documents becomes an important issue. In past, the researches focus on the behavior of nouns in documents. Although subjects are composed of nouns, the constituents that determine which nouns are subjects are not only nouns. Based on the assumption that texts are well-organized and event-driven, nouns and verbs together contribute the process of subject identification. This paper considers four factors: 1) word importance, 2) word frequency, 3) word co-occurrence, and 4) word distance and proposes a model to identify subjects for textual documents. The preliminary experiments show that the performance of the proposed model is close to that of human beings.
This paper proposes a corpus-based language model for topic identification. We analyze the association of noun-noun and noun-verb pairs in LOB Corpus. The word association norms are based on three factors: 1) word importance, 2) pair co-occurrence, and 3) distance. They are trained on the paragraph and sentence levels for noun-noun and noun-verb pairs, respectively. Under the topic coherence postulation, the nouns that have the strongest connectivities with the other nouns and verbs in the discourse form the preferred topic set. The collocational semantics then is used to identify the topics from paragraphs and to discuss the topic shift phenomenon among paragraphs.
To acquire noun phrases from running texts is useful for many applications, such as word grouping,terminology indexing, etc. The reported literatures adopt pure probabilistic approach, or pure rule-based noun phrases grammar to tackle this problem. In this paper, we apply a probabilistic chunker to deciding the implicit boundaries of constituents and utilize the linguistic knowledge to extract the noun phrases by a finite state mechanism. The test texts are SUSANNE Corpus and the results are evaluated by comparing the parse field of SUSANNE Corpus automatically. The results of this preliminary experiment are encouraging.